dgit.raspbian.org Git

x86/HVM: don't crash guest upon problems occurring in user mode

This extends commit 5283b310 ("x86/HVM: only kill guest when unknown VM
exit occurred in guest kernel mode") to a few more cases, including the
failed VM entry one that XSA-110 was needed to be issued for.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Tim Deegan <tim@xen.org>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Release-Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

x86: don't ignore foreigndom input on various MMUEXT ops

Instead properly fail requests that shouldn't be issued on foreign
domains or - for MMUEXT_{CLEAR,COPY}_PAGE - extend the existing
operation to work that way.

In the course of doing this the need to always clear "okay" even when
wanting an error code other than -EINVAL became unwieldy, so the
respective logic is being adjusted at once, together with a little
other related cleanup.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Tim Deegan <tim@xen.org>
Release-Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

x86: tighten page table owner checking in do_mmu_update()

MMU_MACHPHYS_UPDATE, not manipulating page tables, shouldn't ignore
a bad page table domain being specified.

Also pt_owner can't be NULL when reaching the "out" label, so the
respective check can be dropped.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Tim Deegan <tim@xen.org>
Release-Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

x86/cpuidle: don't count C1 multiple times

Commit 4ca6f9f0 ("x86/cpuidle: publish new states only after fully
initializing them") resulted in the state counter to be incremented
for C1 despite that using a fixed table entry (and the statically
initialized counter value already accounting for it and C0).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Release-Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

dpci: add 'masked' as a gate for hvm_dirq_assist to process

commit f6dd295381f4b6a66acddacf46bca8940586c8d8 "dpci: replace tasklet
with softirq" used the 'masked' as an two-bit state mechanism
(STATE_SCHED, STATE_RUN) to communicate between 'raise_softirq_for' and
'dpci_softirq' to determine whether the 'struct hvm_pirq_dpci' can be
re-scheduled.

However it ignored the 'pt_irq_guest_eoi' was not adhering to the proper
dialogue and was not using locked cmpxchg or test_bit operations and
ended setting 'state' set to zero. That meant 'raise_softirq_for' was
free to schedule it while the 'struct hvm_pirq_dpci'' was still on an
per-cpu list causing an list corruption.

The code would trigger the following path causing list corruption:

    \-timer_softirq_action
     pt_irq_time_out calls pt_pirq_softirq_cancel sets state to 0.
            pirq_dpci is still on dpci_list.
    \- dpci_sofitrq
     while (!list_emptry(&our_list))
     list_del, but has not yet done 'entry->next = LIST_POISON1;'
    [interrupt happens]
     raise_softirq checks state which is zero. Adds pirq_dpci to the dpci_list.
    [interrupt is done, back to dpci_softirq]
     finishes the entry->next = LIST_POISON1;
     .. test STATE_SCHED returns true, so executes the hvm_dirq_assist.
     ends the loop, exits.

    \- dpci_softirq
     while (!list_emtpry)
     list_del, but ->next already has LIST_POISON1 and we blow up.

An alternative solution was proposed (adding STATE_ZOMBIE and making
pt_irq_time_out use the cmpxchg protocol on 'state'), which fixed the above
issue but had an fatal bug. It would miss interrupts that are to be scheduled!

This patch brings back the 'masked' boolean which is used as an
communication channel between 'hvm_do_IRQ_dpci', 'hvm_dirq_assist' and
'pt_irq_guest_eoi'. When we have an interrupt we set 'masked'. Anytime
'hvm_dirq_assist' or 'pt_irq_guest_eoi' executes - it clears it.

The 'state' is left as a seperate mechanism to provide an mechanism between
'raise_sofitrq' and 'softirq_dpci' to communicate the state of the
'struct hvm_dirq_pirq'.

However since we have now two seperate machines we have to deal with an
cancellations and outstanding interrupt being serviced: 'pt_irq_destroy_bind'
is called while an 'hvm_dirq_assist' is just about to service.
The 'pt_irq_destroy_bind' takes the lock first and kills the timer - and
the moment it releases the spinlock, 'hvm_dirq_assist' thunders in and calls
'set_timer' hitting an ASSERT.

By clearing the 'masked' in the 'pt_irq_destroy_bind' we take care of that
scenario by inhibiting 'hvm_dirq_assist' to call the 'set_timer'.

In the 'pt_irq_create_bind' - in the error cases we could be seeing
an softirq scheduled right away and being serviced (though stuck at
the spinlock).  The 'pt_irq_create_bind' fails in 'pt_pirq_softirq_reset'
to change the 'state' (as the state is in 'STATE_RUN', not 'STATE_SCHED').
'pt_irq_create_bind' continues on with setting '->flag=0' and unlocks the lock.

'hvm_dirq_assist' grabs the lock and continues one. Since 'flag = 0' and
'digl_list' is empty, it thunders through the 'hvm_dirq_assist' not doing
anything until it hits 'set_timer' which is undefined for MSI. Adding
in 'masked=0' for the MSI case fixes that.

The legacy interrupt one does not need it as there is no chance of
do_IRQ being called at that point.

Reported-by: Sander Eikelenboom <linux@eikelenboom.it>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Acked-by: Jan Beulich <jbeulich@suse.com>

x86/mm: fix a reference counting error in MMU_MACHPHYS_UPDATE

Any domain which can pass the XSM check against a translated guest can cause a
page reference to be leaked.

While shuffling the order of checks, drop the quite-pointless MEM_LOG(). This
brings the check in line with similar checks in the vicinity.

Discovered while reviewing the XSA-109/110 followup series.

This is XSA-113.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Tim Deegan <tim@xen.org>

docs/commandline: Fix formatting issues

For 'dom0_max_vcpus' and 'hvm_debug', markdown was interpreting the text as
regular text, and reflowing it as a regular paragraph, leading to a single
line as output. Reformat them as code blocks inside blockquote blocks, which
causes them to take their precise whitespace layout.

For 'psr', the bullet point was incorrectly delineated from paragraph text,
causing it to be reflowed. Alter the formatting to include the CMT-specific
options as sub-bullets of the overall CMT resource.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Ian Campbell <Ian.Campbell@citrix.com>
Release-acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
CC: Ian Jackson <Ian.Jackson@eu.citrix.com>
CC: Wei Liu <wei.liu2@citrix.com>

xen: arm: correct specific mappings for PCIE0 on X-Gene

The region assigned to PCIE0, according to the docs, is 0x0e000000000 to
0x10000000000. They make no distinction between PCI CFG and PCI IO mem within
this range (in fact, I'm not sure that isn't up to the driver).

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Reviewed-by: Julien Grall <julien.grall@linaro.org>

xen: arm: correct off by one in xgene-storm's map_one_mmio

The callers pass the end as the pfn immediately *after* the last page to be
mapped, therefore adding one is incorrect and causes an additional page to be
mapped.

At the same time correct the printing of the mfn values, zero-padding them to
16 digits as for a paddr when they are frame numbers is just confusing.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Reviewed-by: Julien Grall <julien.grall@linaro.org>

xen: arm: Drop EARLY_PRINTK_BAUD from entries which don't set ..._INIT_UART

EARLY_PRINTK_BAUD doesn't do anything unless EARLY_PRINTK_INIT_UART is set.

Furthermore only the pl011 driver implements the init routine at all, so the
entries which use any other UART driver and specified a BAUD were doubly wrong.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Reviewed-by: Julien Grall <julien.grall@linaro.org>

xen: arm: Add earlyprintk for McDivitt.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Reviewed-by: Julien Grall <julien.grall@linaro.org>

docs: workaround markdown parser error in xen-command-line.markdown

Some versions of markdown (specifically the one in Debian Wheezy, currently
used to generate
http://xenbits.xen.org/docs/unstable/misc/xen-command-line.html) seem to be
confused by nested lists in the middle of multi-paragraph parent list entries
as seen in the com1,com2 entry.

The effect is that the "Default" section of all following entries are replace
by some sort of hash or checksum (at least, a string of 32 random seeming hex
digits).

Workaround this issue by making the decriptions of the DPS options a nested
list, moving the existing nested list describing the options for S into a third
level list. This seems to avoid the issue, and is arguably better formatting in
its own right (at least its not a regression IMHO)

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

mkdeb: correctly map package architectures for x86 and ARM

mkdeb previously set the package architecture to be 'amd64' for anything other than
XEN_TARGET_ARCH=x86_32. This patch attempts to correctly map the architecture
from XEN_TARGET_ARCH to the Debian architecture names for x86 and ARM
architectures.

Signed-off-by: Clark Laughlin <clark.laughlin@linaro.org>
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Ian Jackson <Ian.Jackson@eu.citrix.com>

libxl: Document device parameter of libxl_device_<type>_add functions

The device parameter of libxl_device_<type>_add is an in/out parameter.
Unspecified fields are filled in with appropriate values for the created
device when the function returns. Document this behaviour.

Signed-off-by: Euan Harris <euan.harris@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>

libxl: remove existence check for PCI device hotplug

The existence check is to make sure a device is not added to a guest
multiple times.

PCI device backend path has different rules from vif, disk etc. For
example:
/local/domain/0/backend/pci/9/0/dev-1/0000:03:10.1
/local/domain/0/backend/pci/9/0/key-1/0000:03:10.1
/local/domain/0/backend/pci/9/0/dev-2/0000:03:10.2
/local/domain/0/backend/pci/9/0/key-2/0000:03:10.2

The devid for PCI devices is hardcoded 0. libxl__device_exists only
checks up to /local/.../9/0 so it always returns true even the device is
assignable.

Remove invocation of libxl__device_exists. We're sure at this point that
the PCI device is assignable (hence no xenstore entry or JSON entry).
The check is done before hand. For HVM guest it's done by calling
xc_test_assign_device and for PV guest it's done by calling
pciback_dev_is_assigned.

Reported-by: Li, Liang Z <liang.z.li@intel.com>
Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Cc: Ian Campbell <ian.campbell@citrix.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Konrad Wilk <konrad.wilk@oracle.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>

libxl: CODING_STYLE: Discuss existing style problems

Document that:
- the existing code is not all confirming yet
- code should conform
- we will sometimes accept patches with nonconforming elements if
they don't make matters worse.

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Release-Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

libxl: CODING_STYLE: Mention function out parameters

We seem to use both `_r' and `_out'. Document both.

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Release-Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

libxl: CODING_STYLE: Deprecate `error' for out blocks

We should have only one name for this and `out' is more prevalent.

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Release-Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

libxl: CODING_STYLE: Much new material

Discuss:

    Memory allocation
    Conventional variable names
    Convenience macros
    Error handling
    Idempotent data structure construction/destruction
    Asynchronous/long-running operations

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Release-Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

x86emul: enforce privilege level restrictions when loading CS

Privilege level checks were basically missing for the CS case, the
only check that was done (RPL == DPL for nonconforming segments)
was solely covering a single special case (return to non-conforming
segment).

Additionally in long mode the L bit set requires the D bit to be clear,
as was recently pointed out for KVM by Nadav Amit
<namit@cs.technion.ac.il>.

Finally we also need to force the loaded selector's RPL to CPL (at
least as long as lret/retf emulation doesn't support privilege level
changes).

This is CVE-2014-8595 / XSA-110.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Tim Deegan <tim@xen.org>

x86: don't allow page table updates on non-PV page tables in do_mmu_update()

paging_write_guest_entry() and paging_cmpxchg_guest_entry() aren't
consistently supported for non-PV guests (they'd deref NULL for PVH or
non-HAP HVM ones). Don't allow respective MMU_* operations on the
page tables of such domains.

This is CVE-2014-8594 / XSA-109.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Tim Deegan <tim@xen.org>

EFI: allow retry of ExitBootServices() call

The specification is kind of vague under what conditions
ExitBootServices() may legitimately fail, requiring the OS loader to
retry:

"If MapKey value is incorrect, ExitBootServices() returns
EFI_INVALID_PARAMETER and GetMemoryMap() with ExitBootServices() must
be called again. Firmware implementation may choose to do a partial
shutdown of the boot services during the first call to
ExitBootServices(). EFI OS loader should not make calls to any boot
service function other then GetMemoryMap() after the first call to
ExitBootServices()."

While our code guarantees the map key to be valid, there are systems
where a firmware internal notification sent while processing
ExitBootServices() reportedly results in changes to the memory map.
In that case, make a best effort second try: Avoid any boot service
calls other than the two named above, with the possible exception of
error paths. Those aren't a problem, since if we end up needing to
retry, we're hosed when something goes wrong as much as if we didn't
make the retry attempt.

For x86, a minimal adjustment to efi_arch_process_memory_map() is
needed for it to cope with potentially being called a second time.

For arm64, while efi_process_memory_map_bootinfo() is easy to verify
that it can safely be called more than once without violating spec
constraints, it's not so obvious for fdt_add_uefi_nodes(), hence a
step by step approach:
- deletion of memory nodes and memory reserve map entries: the 2nd pass
shouldn't find any as the 1st one deleted them all,
- a "chosen" node should be found as it got added in the 1st pass,
- the various "linux,uefi-*" nodes all got added during the 1st pass
and hence only their contents may get updated.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roy Franz <roy.franz@linaro.org>
Release-acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

x86: (allow to) override LIST_POISON*

Having these point into space not controlled by the hypervisor provides
an unnecessary attack surface. Allow architectures to override them and
utilize that override to make them non-canonical addresses (thus
causing #GP rather than #PF when dereferenced).

Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Release-acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

libxl: add missing action in DEFINE_DEVICE_ADD

... otherwise when device add operation fails, the error message looks
like "libxl: error: libxl.c:1897:device_addrm_aocomplete: unable to (null)
device", which is not very helpful.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Cc: Ian Campbell <ian.campbell@citrix.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>

vTPM: Fix Atmel timeout bug.

Some versions of Atmel TPMs provide invalid values for TPM_CAP_PROP_TIS_TIMEOUT query.
Because timeouts are invalid, every other command after tpm_get_timeouts will fail.
It is a known issue and it was fixed recently in linux kernel tpm_tis.c on 2014-07-29.
This patch does not allow timeouts to be less than standard values.
I tested it on a Dell Latitude E5520 and after making the changes I was able to start vtpmmgr-stubdom.

Signed-off-by: Emil Condrea <emilcondrea@gmail.com>
Acked-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>

xl: correct test condition on libxl_domain_info

The `if' statement considered return value 0 from libxl_domain_info an
error, while 0 actually means success.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Cc: Ian Campbell <ian.campbell@citrix.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>

tools/hotplug: use configure --sysconfdir result

... instead of hardcoding values and guess where they config files may
be. Also use the result of --with-sysconfig-leaf-dir.

Signed-off-by: Olaf Hering <olaf@aepfle.de>
Cc: Ian Campbell <ian.campbell@citrix.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Cc: Wei Liu <wei.liu2@citrix.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>

adjust number of domains in cpupools when destroying domain

Commit bac6334b51d9bcfe57ecf4a4cb5288348fcf044a (move domain to
cpupool0 before destroying it) introduced an error in the accounting
of cpupools regarding the number of domains. The number of domains
is nor adjusted when a domain is moved to cpupool0 in kill_domain().

Correct this by introducing a cpupool function doing the move
instead of open coding it by calling sched_move_domain().

Reported-by: Dietmar Hahn <dietmar.hahn@ts.fujitsu.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Tested-by: Dietmar Hahn <dietmar.hahn@ts.fujitsu.com>
Reviewed-by: Andrew Cooper <Andrew.Cooper3@citrix.com>
Acked-by: George Dunlap <george.dunlap@eu.citrix.com>

dpci: replace tasklet with softirq

The existing tasklet mechanism has a single global spinlock that is
taken every-time the global list is touched. And we use this lock quite
a lot - when we call do_tasklet_work which is called via an softirq and
from the idle loop. We take the lock on any operation on the
tasklet_list.

The problem we are facing is that there are quite a lot of tasklets
scheduled. The most common one that is invoked is the one injecting the
VIRQ_TIMER in the guest. Guests are not insane and don't set the
one-shot or periodic clocks to be in sub 1ms intervals (causing said
tasklet to be scheduled for such small intervalls).

The problem appears when PCI passthrough devices are used over many
sockets and we have an mix of heavy-interrupt guests and idle guests.
The idle guests end up seeing 1/10 of its RUNNING timeslice eaten by
the hypervisor (and 40% steal time).

The mechanism by which we inject PCI interrupts is by hvm_do_IRQ_dpci
which schedules the hvm_dirq_assist tasklet every time an interrupt is
received. The callchain is:

_asm_vmexit_handler
-> vmx_vmexit_handler
    ->vmx_do_extint
        -> do_IRQ
            -> __do_IRQ_guest
                -> hvm_do_IRQ_dpci
                   tasklet_schedule(&dpci->dirq_tasklet);
                   [takes lock to put the tasklet on]

[later on the schedule_tail is invoked which is 'vmx_do_resume']

vmx_do_resume
-> vmx_asm_do_vmentry
        -> call vmx_intr_assist
          -> vmx_process_softirqs
            -> do_softirq
              [executes the tasklet function, takes the
               lock again]

While on other CPUs they might be sitting in a idle loop and invoked to
deliver an VIRQ_TIMER, which also ends up taking the lock twice: first
to schedule the v->arch.hvm_vcpu.assert_evtchn_irq_tasklet (accounted
to the guests' BLOCKED_state); then to execute it - which is accounted
for in the guest's RUNTIME_state.

The end result is that on a 8 socket machine with PCI passthrough,
where four sockets are busy with interrupts, and the other sockets have
idle guests - we end up with the idle guests having around 40% steal
time and 1/10 of its timeslice (3ms out of 30 ms) being tied up taking
the lock. The latency of the PCI interrupts delieved to guest is also
hindered.

With this patch the problem disappears completly. That is removing the
lock for the PCI passthrough use-case (the 'hvm_dirq_assist' case) by
not using tasklets at all.

The patch is simple - instead of scheduling an tasklet we schedule our
own softirq - HVM_DPCI_SOFTIRQ, which will take care of running
'hvm_dirq_assist'. The information we need on each CPU is which
'struct hvm_pirq_dpci' structure the 'hvm_dirq_assist' needs to run on.
That is simple solved by threading the 'struct hvm_pirq_dpci' through a
linked list. The rule of only running one 'hvm_dirq_assist' for only
one 'hvm_pirq_dpci' is also preserved by having 'schedule_dpci_for'
ignore any subsequent calls for an domain which has already been
scheduled.

== Code details ==

Most of the code complexity comes from the '->dom' field in the
'hvm_pirq_dpci' structure. We use it for ref-counting and as such it
MUST be valid as long as STATE_SCHED bit is set. Whoever clears the
STATE_SCHED bit does the ref-counting and can also reset the '->dom'
field.

To compound the complexity, there are multiple points where the
'hvm_pirq_dpci' structure is reset or re-used. Initially (first time
the domain uses the pirq), the 'hvm_pirq_dpci->dom' field is set to
NULL as it is allocated. On subsequent calls in to 'pt_irq_create_bind'
the ->dom is whatever it had last time.

As this is the initial call (which QEMU ends up calling when the guest
writes an vector value in the MSI field) we MUST set the '->dom' to a
the proper structure (otherwise we cannot do proper ref-counting).

The mechanism to tear it down is more complex as there are three ways
it can be executed. To make it simpler everything revolves around
'pt_pirq_softirq_active'. If it returns -EAGAIN that means there is an
outstanding softirq that needs to finish running before we can continue
tearing down. With that in mind:

a) pci_clean_dpci_irq. This gets called when the guest is being
   destroyed. We end up calling 'pt_pirq_softirq_active' to see if it
   is OK to continue the destruction.

   The scenarios in which the 'struct pirq' (and subsequently the
   'hvm_pirq_dpci') gets destroyed is when:

   - guest did not use the pirq at all after setup.
   - guest did use pirq, but decided to mask and left it in that state.
   - guest did use pirq, but crashed.

   In all of those scenarios we end up calling 'pt_pirq_softirq_active'
   to check if the softirq is still active. Read below on the
   'pt_pirq_softirq_active' loop.

b) pt_irq_destroy_bind (guest disables the MSI). We double-check that
   the softirq has run by piggy-backing on the existing
   'pirq_cleanup_check' mechanism which calls 'pt_pirq_cleanup_check'.
   We add the extra call to 'pt_pirq_softirq_active' in
   'pt_pirq_cleanup_check'.

   NOTE: Guests that use event channels unbind first the event channel
   from PIRQs, so the 'pt_pirq_cleanup_check' won't be called as 'event'
   is set to zero. In that case we either clean it up via the a) or c)
   mechanism.

   There is an extra scenario regardless of 'event' being set or not:
   the guest did 'pt_irq_destroy_bind' while an interrupt was triggered
   and softirq was scheduled (but had not been run). It is OK to still
   run the softirq as hvm_dirq_assist won't do anything (as the flags
   are set to zero). However we will try to deschedule the softirq if
   we can (by clearing the STATE_SCHED bit and us doing the
   ref-counting).

c) pt_irq_create_bind (not a typo). The scenarios are:

   - guest disables the MSI and then enables it (rmmod and modprobe in
     a loop). We call 'pt_pirq_reset' which checks to see if the
     softirq has been scheduled. Imagine the 'b)' with interrupts in
     flight and c) getting called in a loop.

We will spin up on 'pt_pirq_is_active' (at the start of the
'pt_irq_create_bind') with the event_lock spinlock dropped and waiting
(cpu_relax). We cannot call 'process_pending_softirqs' as it might
result in a dead-lock. hvm_dirq_assist will be executed and then the
softirq will clear 'state' which signals that that we can re-use the
'hvm_pirq_dpci' structure. In case this softirq is scheduled on a
remote CPU the softirq will run on it as the semantics behind an
softirq is that it will execute within the guest interruption.

   - we hit once the error paths in 'pt_irq_create_bind' while an
     interrupt was triggered and softirq was scheduled.

If the softirq is in STATE_RUN that means it is executing and we should
let it continue on. We can clear the '->dom' field as the softirq has
stashed it beforehand. If the softirq is STATE_SCHED and we are
successful in clearing it, we do the ref-counting and clear the '->dom'
field. Otherwise we let the softirq continue on and the '->dom' field
is left intact. The clearing of the '->dom' is left to a), b) or again
c) case.

Note that in both cases the 'flags' variable is cleared so
hvm_dirq_assist won't actually do anything.

Suggested-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

dpci: move from an hvm_irq_dpci (and struct domain) to an hvm_dirq_dpci model

When an interrupt for an PCI (or PCIe) passthrough device is to be sent
to a guest, we find the appropiate 'hvm_dirq_dpci' structure for the
interrupt (PIRQ), set a bit (masked), and schedule an tasklet.

Then the 'hvm_dirq_assist' tasklet gets called with the 'struct domain'
from where it iterates over the the radix-tree of 'hvm_dirq_dpci' (from
zero to the number of PIRQs allocated) which are masked to the guest
and calls each 'hvm_pirq_assist'. If the PIRQ has a bit set (masked) it
figures out how to inject the PIRQ to the guest.

This is inefficient and not fair as:
- We iterate starting at PIRQ 0 and up every time. That means the PCIe
   devices that have lower PIRQs get to be called first.
- If we have many PCIe devices passed in with many PIRQs and if most
   of the time only the highest numbered PIRQ get an interrupt (as the
   initial ones are for control) we end up iterating over many PIRQs.

But we could do beter - the 'hvm_dirq_dpci' has the field for
'struct domain', which we can use instead of having to pass in the
'struct domain'.

As such this patch moves the tasklet to the 'struct hvm_dirq_dpci' and
sets the 'dom' field to the domain. We also double-check that the
'->dom' is not reset before using it.

We have to be careful with this as that means we MUST have 'dom' set
before pirq_guest_bind() is called. As such we add the
'pirq_dpci->dom = d;' to cover for such cases.

The mechanism to tear it down is more complex as there are two ways it
can be executed:

a) pci_clean_dpci_irq. This gets called when the guest is being
    destroyed. We end up calling 'tasklet_kill'.

    The scenarios in which the 'struct pirq' (and subsequently the
    'hvm_pirq_dpci') gets destroyed is when:

    - guest did not use the pirq at all after setup.
    - guest did use pirq, but decided to mask and left it in that
      state.
    - guest did use pirq, but crashed.

    In all of those scenarios we end up calling 'tasklet_kill' which
    will spin on the tasklet if it is running.

b) pt_irq_destroy_bind (guest disables the MSI). We double-check that
    the softirq has run by piggy-backing on the existing
    'pirq_cleanup_check' mechanism which calls 'pt_pirq_cleanup_check'.
    We add the extra call to 'pt_pirq_softirq_active' in
    'pt_pirq_cleanup_check'.

    NOTE: Guests that use event channels unbind first the event channel
    from PIRQs, so the 'pt_pirq_cleanup_check' won't be called as event
    is set to zero. In that case we either clean it up via the a)
    mechanism. It is OK to re-use the tasklet when 'pt_irq_create_bind'
    is called afterwards.

    There is an extra scenario regardless of event being set or not:
    the guest did 'pt_irq_destroy_bind' while an interrupt was
    triggered and tasklet was scheduled (but had not been run). It is
    OK to still run the tasklet as hvm_dirq_assist won't do anything
    (as the flags are set to zero). As such we can exit out of
    hvm_dirq_assist without doing anything.

Suggested-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

sched_rt: serialize vcpu data access

Fix the following two issues in rtds scheduler:
1) The runq queue lock is not grabbed when rt_update_deadline is
called in rt_alloc_vdata function, which may cause race condition;
Solution: Move call to rt_update_deadline from _alloc to _insert;
Note: rt_alloc_vdata does not need grab the runq lock, because only one
cpu will allocate the rt_vcpu; before the rt_vcpu is inserted into the
runq, no more than one cpu operates on the rt_vcpu.

2) rt_vcpu_remove should grab the runq lock before remove the vcpu
from runq; otherwise, race condition may happen.
Solution: Add lock in rt_vcpu_remove().

Reported-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Meng Xu <mengxu@cis.upenn.edu>
Reviewed-by: Dario Faggioli <dario.faggioli@citrix.com>
Reviewed-by: George Dunlap <george.dunlap@eu.citrix.com>
Release-Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

sched_rt: sanity check input and serialization

Sanity check input params in rt_dom_cntl();
Serialize rt_dom_cntl() against the global lock.

Reported-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Meng Xu <mengxu@cis.upenn.edu>
Reviewed-by: Dario Faggioli <dario.faggioli@citrix.com>
Reviewed-by: George Dunlap <george.dunlap@eu.citrix.com>
Release-Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

fix commit xen/arm: Add support for GICv3 for domU

The build of xen-4.5.0-rc2 fails if XSM_ENABLE=y due to an inconsistency
in commit fda1614 "xen/arm: Add support for GICv3 for domU" which uses
XEN_DOMCTL_configure_domain in xen/xsm/flask/hooks.c and
xen/xsm/flask/policy/access_vectors but XEN_DOMCTL_arm_configure_domain
elsewhere.

Michael Young
In fda1614 ("xen/arm: Add support for GICv3 for domU")
XEN_DOMCTL_configure_domain is used in xen/xsm/flask/hooks.c and
xen/xsm/flask/policy/access_vectors but XEN_DOMCTL_arm_configure_domain
is used elsewhere.

Signed-off-by: Michael Young <m.a.young@durham.ac.uk>
Acked-by: Ian Campbell <ian.campbell@citrix.com>

Xen 4.5.0-rc2: Update tag for QEMU upstream tree....

QEMU traditional can stay at rc1 since there are no changes in it.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

pvgrub: ignore NUL

When using pvgrub in graphical mode with vnc, the grub timeout doesn't
work: the countdown doesn't even start. With a serial terminal the
problem doesn't occur and the countdown works as expected.

It turns out that the problem is that when using a graphical terminal,
checkkey () returns 0 instead of -1 when there is no activity on the
mouse or keyboard. As a consequence grub thinks that the user typed
something and interrupts the count down.

To fix the issue simply ignore keystrokes returning 0, that is the NUL
character anyway. Add a patch to grub.patches to do that.

Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Tested-by: Steven Haigh <netwiz@crc.id.au>
Acked-by: Samuel Thibault <samuel.thibault@ens-lyon.org>

xen/arm: Add support for GICv3 for domU

The vGIC will emulate the same version as the hardware. The toolstack has
to retrieve the version of the vGIC in order to be able to create the
corresponding device tree node.

A new DOMCTL has been introduced for ARM to configure the domain. For now
it only allow the toolstack to retrieve the version of vGIC.
This DOMCTL will be extend later to let the user choose the version of the
emulated GIC.

Signed-off-by: Vijaya Kumar K <Vijaya.Kumar@caviumnetworks.com>
Signed-off-by: Julien Grall <julien.grall@linaro.org>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Jan Beulich <jbeulich@suse.com>
Acked-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
Cc: Wei Liu <wei.liu2@citrix.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>

tools: libxl: do not overrun input buffer in libxl__parse_mac

Valgrind reports:
==7971== Invalid read of size 1
==7971==    at 0x40877BE: libxl__parse_mac (libxl_internal.c:288)
==7971==    by 0x405C5F8: libxl__device_nic_from_xs_be (libxl.c:3405)
==7971==    by 0x4065542: libxl__append_nic_list_of_type (libxl.c:3484)
==7971==    by 0x4065542: libxl_device_nic_list (libxl.c:3504)
==7971==    by 0x406F561: libxl_retrieve_domain_configuration (libxl.c:6661)
==7971==    by 0x805671C: reload_domain_config (xl_cmdimpl.c:2037)
==7971==    by 0x8057F30: handle_domain_death (xl_cmdimpl.c:2116)
==7971==    by 0x8057F30: create_domain (xl_cmdimpl.c:2580)
==7971==    by 0x805B4B2: main_create (xl_cmdimpl.c:4652)
==7971==    by 0x804EAB2: main (xl.c:378)

This is because on the final iteration the tok += 3 skips over the terminating
NUL to the next byte, and then *tok reads it. Fix this by using endptr as the
iterator.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Reviewed-by: Don Slutz <dslutz@verizon.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>

tools: libxl: do not leak diskpath during local disk attach

libxl__device_disk_local_initiate_attach is assigning dls->diskpath with a
strdup of the device path. This is then passed to the callback, e.g.
parse_bootloader_result but bootloader_cleanup will not free it.

Since the callback is within the scope of the (e)gc and therefore doesn't need
to be malloc'd, a gc'd alloc will do. All other assignments to this field use
the gc.

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=767295

Reported-by: Gedalya <gedalya@gedalya.net>
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>

xen: arm: propagate gic's #address-cells property to dom0.

The interrupt-map property requires that the interrupt-parent node
must have both #address-cells and #interrupt-cells properties (see
ePAPR 2.4.3.1). Therefore propagate the property if it is present.

We must propagate (rather than invent our own value) since this value
is used to size fields within other properties within the tree.

ePAPR strictly speaking requires that the interrupt-parent node
always has these properties. However reality has diverged from this
and implementations will recursively search parents for #*-cells
properties. Hence we only copy if it is present.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Reviewed-by: Julien Grall <julien.grall@linaro.org>

xen: arm: configure correct dom0_gnttab_start/size

Vexpress is currently failing to boot for me with:
------------[ cut here ]------------
WARNING: CPU: 0 PID: 1 at arch/arm/mm/ioremap.c:301 __arm_ioremap_pfn_caller+0x118/0x1a4()
CPU: 0 PID: 1 Comm: swapper Tainted: G        W     3.16.0-arm-native+ #276
[<c0011e9c>] (unwind_backtrace) from [<c0010758>] (show_stack+0x10/0x14)
[<c0010758>] (show_stack) from [<c001a3ec>] (warn_slowpath_common+0x5c/0x7c)
[<c001a3ec>] (warn_slowpath_common) from [<c001a4c8>] (warn_slowpath_null+0x18/0x20)
[<c001a4c8>] (warn_slowpath_null) from [<c001488c>] (__arm_ioremap_pfn_caller+0x118/0x1a4)
[<c001488c>] (__arm_ioremap_pfn_caller) from [<c00149a0>] (__arm_ioremap+0x14/0x20)
[<c00149a0>] (__arm_ioremap) from [<c01d103c>] (gnttab_setup_auto_xlat_frames+0x30/0xdc)
[<c01d103c>] (gnttab_setup_auto_xlat_frames) from [<c0495324>] (xen_guest_init+0x19c/0x2d4)
[<c0495324>] (xen_guest_init) from [<c0492c6c>] (do_one_initcall+0xfc/0x1a4)
[<c0492c6c>] (do_one_initcall) from [<c0492d6c>] (kernel_init_freeable+0x58/0x1b4)
[<c0492d6c>] (kernel_init_freeable) from [<c039611c>] (kernel_init+0x8/0xe4)
[<c039611c>] (kernel_init) from [<c000de58>] (ret_from_fork+0x14/0x3c)
---[ end trace 3406ff24bd97382f ]---
xen:grant_table: Failed to ioremap gnttab share frames (addr=0x00000000b0000000)!

which is:
        /*
         * Don't allow RAM to be mapped - this causes problems with ARMv6+
         */
        if (WARN_ON(pfn_valid(pfn)))
                return NULL;

This makes sense since the gnttab defaults to 0xb000000 and my dom0
is being allocated a 1:1 mapping at 0xa0000000-0xc0000000.

I suspect this broke around the time we stopped forcing dom0 memory to be
allocated as low as possible which happened to prevent the default dom0_gnttab
region overlapping RAM.

This patch specifies an explicit dom0_gnttab base which is explicitly unused
according to the FVP model docs (although it corresponds to CS5 this isn't
wired up to anything).

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Reviewed-by: Julien Grall <julien.grall@linaro.org>

xentop: Dynamically expand some columns

Allow certain xentop columns to automatically expand as the amount
of data reported gets larger. The columns allowed to auto expand are:

NETTX(k), NETRX(k), VBD_RD, VBD_WR, VBD_RSECT, VBD_WSECT

If the -f option is used to allow full length VM names, those names will
also be aligned based on the longest name in the NAME column.

The default minimum width of all columns remains unchanged.

Signed-off-by: Markus Hauschild <Markus.Hauschild@rz.uni-regensburg.de>
Signed-off-by: Charles Arnold <carnold@suse.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>

... as being more like a hypervisor extension into the guest than a
part of the tool stack.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>

x86: disable emulate.c REP optimization if introspection is active

Emulation for REP instructions is optimized to perform a single
write for all repeats in the current page if possible. However,
this interferes with a memory introspection application's ability
to detect suspect behaviour, since it will cause only one
mem_event to be sent per page touched.
This patch disables the optimization, gated on introspection
being active for the domain.

Signed-off-by: Razvan Cojocaru <rcojocaru@bitdefender.com>

EFI: ignore EFI commandline, skip console setup when booted from GRUB

Update EFI code to completely ignore the EFI comnandline when booted from GRUB.
Previusly it was parsed of EFI boot specific options, but these aren't used
when booted from GRUB.

Don't do EFI console or video configuration when booted by GRUB. The EFI boot
code does some console and video initialization to support native EFI boot from
the EFI boot manager or EFI shell. This initlization should not be done when
booted using GRUB.

Update EFI documentation to indicate that it describes EFI native boot, and
does not apply at all when Xen is booted using GRUB.

Signed-off-by: Roy Franz <roy.franz@linaro.org>

lzo: check for length overrun in variable length encoding

This fix ensures that we never meet an integer overflow while adding
255 while parsing a variable length encoding. It works differently from
commit 504f70b6 ("lzo: properly check for overruns") because instead of
ensuring that we don't overrun the input, which is tricky to guarantee
due to many assumptions in the code, it simply checks that the cumulated
number of 255 read cannot overflow by bounding this number.

The MAX_255_COUNT is the maximum number of times we can add 255 to a base
count without overflowing an integer. The multiply will overflow when
multiplying 255 by more than MAXINT/255. The sum will overflow earlier
depending on the base count. Since the base count is taken from a u8
and a few bits, it is safe to assume that it will always be lower than
or equal to 2*255, thus we can always prevent any overflow by accepting
two less 255 steps.

This patch also reduces the CPU overhead and actually increases performance
by 1.1% compared to the initial code, while the previous fix costs 3.1%
(measured on x86_64).

The fix needs to be backported to all currently supported stable kernels.

Reported-by: Willem Pinckaers <willem@lekkertech.net>
Signed-off-by: Willy Tarreau <w@1wt.eu>
[original Linux commit: 72cf9012]
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>

Revert "lzo: properly check for overruns"

This reverts commit 504f70b6 ("lzo: properly check for overruns").

As analysed by Willem Pinckaers, this fix is still incomplete on
certain rare corner cases, and it is easier to restart from the
original code.

Reported-by: Willem Pinckaers <willem@lekkertech.net>
Signed-off-by: Willy Tarreau <w@1wt.eu>
[original Linux commit: af958a38]
Signed-off-by: Jan Beulich <jbeulich@suse.com>

x86/PVH: replace bogus assertion with conditional

While PVH guests currently have to start in 64-bit mode, nothing keeps
them from entering compatibility mode via a suitable ring-0 code
selector and making a hypercall from there. Fail such attempts rather
than asserting they won't happen.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

libxl: a domain can be dying but not shutdown

The shutdown code is only present if the domain is shutdown.
If we attempt to extract it from the flags from a dying but not
shutdown domain then we get values like '255' which is not a
valid LIBXL_SHUTDOWN_REASON_. We should use LIBXL_SHUTDOWN_UNKNOWN
in this case.

Signed-off-by: David Scott <dave.scott@citrix.com>
Acked-by: Rob Hoes <rob.hoes@citrix.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
[ ijc -- updated comment in libxl_types.idl to match ]

blktap: CONFIG_GCRYPT detection

Wrap make variable in () to allow correct evaluation.

This fixes broken CONFIG_GCRYPT detection which was introduced by
commit 85896a7c4dc7b6b1dba2db79dfb0ca61738a92a4 in 2012.

Signed-off-by: Martin Pohlack <mpohlack@amazon.de>
Reviewed-by: Uwe Dannowski <uwed@amazon.de>
Reviewed-by: Anthony Liguori <aliguori@amazon.com>
Reviewed-by: Matt Wilson <msw@amazon.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>

tools/pygrub: Fix TOCTOU race introduced by c/s 63dcc68

In addition, use os.makedirs() which will also create intermediate directories
if they don't exist.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
CC: Ian Campbell <Ian.Campbell@citrix.com>
CC: Ian Jackson <Ian.Jackson@eu.citrix.com>
CC: Wei Liu <wei.liu2@citrix.com>
CC: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
CC: Olaf Hering <olaf@aepfle.de>
Acked-by: Ian Campbell <ian.campbell@citrix.com>

pygrub: fix non-interactive parsing of grub1 config files

Changes to handle non-numeric default attributes for grub2 caused run_grub()
to attempt to index into the images list using a string. Pull out the code
that handles submenus into a new function and use that to ensure sel is
numeric.

Reported-by: David Scott <dave.scott@citrix.com>
Signed-off-by: Simon Rowe <simon.rowe@eu.citrix.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-and-tested-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>

x86/HVM: only kill guest when unknown VM exit occurred in guest kernel mode

A recent KVM change by Nadav Amit <namit@cs.technion.ac.il> pointed out
that unconditional VM exits (like VMX'es ones for the INVEPT, INVVPID,
and XSETBV instructions) may result from guest user mode activity (in
the example cases, e.g. prior to a privilege level check being done).
Consequently convert the unconditional domain_crash() to a conditional
one (when guest is in kernel mode) with the alternative of injecting
#UD (when in user mode).

This is meant to be a precaution against in-guest security issues
introduced when any such VM exit becomes possible (on newer hardware)
without the hypervisor immediately being aware of it. There are no such
unhandled VM exits currently (and hence this is not an active security
issue), but old (no longer security maintained) versions exhibit issues
in the cases given as examples above.

Suggested-by: Tim Deegan <tim@xen.org>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>

VMX: values written to MSR_IA32_SYSENTER_E[IS]P should be canonical

A recent KVM change by Nadav Amit <namit@cs.technion.ac.il> helped spot
that we have the same issue as they did.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Tim Deegan <tim@xen.org>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>

process softirqs while dumping domains

Process softirqs once per domain, and once every 64 vcpus in a guest to avoid
being hit by the NMI watchdog. Discovered against a VM which had accidentally
been assigned 8192 vcpus.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Dario Faggioli <dario.faggioli@citrix.com>

vtd: correct some comments

In some cases Dom0 and hardware domain are not one same domain.

Signed-off-by: Tiejun Chen <tiejun.chen@intel.com>

x86/HVM: sanity check xsave area when migrating or restoring from older Xen versions

Xen 4.3.0, 4.2.3 and older transferred a maximum sized xsave area (as
if all the available XCR0 bits were set); the new version only
transfers based on the actual XCR0 bits. This may result in a smaller
area if the last sections were missing (e.g., the LWP area from an AMD
machine). If the size doesn't match the XCR0 derived size, the size is
checked against the maximum size and the part of the xsave area
between the actual and maximum used size is checked for zero data. If
either the max size check or any part of the overflow area is
non-zero, we return with an error.

Signed-off-by: Don Koch <dkoch@verizon.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

EFI: constify a few table pointers

We shouldn't (and don't) modify any of these tables.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

EFI: allow to suppress the use of runtime services

On certain systems some of the memory map entries designated for use by
runtime services cannot be mapped (frequently due to firmware bugs). On
others, some of the memory map entries aren't even marked for runtime
services use, yet are being used by them. For both cases give people a
way to suppress use of runtime services altogether.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

x86: tolerate running on EFI runtime services page tables in map_domain_page()

In the event of a #PF while in an EFI runtime service function we
otherwise can't dump the page tables, making the analysis of the
problem more cumbersome.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

hvm/load: correct length checks for zeroextended records

In the case that Xen is attempting to load a zeroextended HVM record where the
difference needing extending would overflow the data blob, _hvm_check_entry()
will incorrectly fail before working out that it would have been safe.

The "len + sizeof(*d)" check is wrong. Consider zeroextending a 16 byte
record into a 32 byte structure. "32 + hdr" will fail the overall context
length check even though the pre-extended record in the stream is 16 bytes.

The first condition is reduced to just a length check for hvm save header,
while the second condition is extended to include a check that the record in
the stream not exceeding the stream length.

The error messages are extended to include further useful information.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Paul Durrant <Paul.Durrant@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

vmx: fix save/restore issue with apicv

This patch fixes two issues:

1. Interrupts on PIR are lost during save/restore. Syncing the PIR
into IRR during save will fix it.

2. EOI exit bitmap doesn't set up correctly after restore. Here we
will construct the eoi exit bitmap via (IRR | ISR). Though it may cause
unnecessary eoi exit of the interrupts that pending in IRR or ISR during
save/restore, each pending interrupt only causes one vmexit. The
subsequent interrupts will adjust the eoi exit bitmap correctly. So
the performance hurt can be ignored.

Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Signed-off-by: Olaf Hering <olaf@aepfle.de>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

xen/arm: dump guest stack even if not the current VCPU

If show_guest_stack was called from Xen context (for instance hitting
'0' key on Xen console) get_page_from_gva was not able to get the
page returning NULL.
Detecting different domain and changing VTTBR register make
get_page_from_gva works for different domains.

Signed-off-by: Frediano Ziglio <frediano.ziglio@huawei.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>

Add INSTALL file

Document how to use configure and what to pass to make(1).

Signed-off-by: Olaf Hering <olaf@aepfle.de>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
[ ijc -- fix a few typos ]

tools/hotplug: add helper script to visualize systemd dependencies

A small helper to draw a graph with dot(1) and show it with display(1):
bash tools/hotplug/Linux/systemd/show_service_dependencies.sh \
tools/hotplug/Linux/systemd/*.in

A red line means Requires= aka "enable it"
A blue line means After=
A green line means Before=

Signed-off-by: Olaf Hering <olaf@aepfle.de>
Acked-by: Wei Liu <wei.liu2@citrix.com>

tools/hotplug: every systemd service depends on proc-xen.mount

Every systemd service file uses /proc/xen/capabilites to check if it
runs in a dom0. Update every service file to enable proc-xen.mount with
the Requires= statement and schedule its startup with the After=
statement.
In some places var-lib-xenstored.mount is removed. This is ok because
its optional and this unit is enabled by xenstored itself. After all its
a private directory for xenstored.

Signed-off-by: Olaf Hering <olaf@aepfle.de>
Cc: Ian Campbell <ian.campbell@citrix.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Cc: Wei Liu <wei.liu2@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>

tools/hotplug: systemd xenstored dependencies

Everything which uses xenstored does this via the socket. Update the
existing service files to enable the xenstored.socket with the Requires=
statement. And schedule startup of the given service files after the
socket is enabled with the After= statement.
Once something tries to access the socket systemd will launch xenstored.

Signed-off-by: Olaf Hering <olaf@aepfle.de>
Cc: Ian Campbell <ian.campbell@citrix.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Cc: Wei Liu <wei.liu2@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>

tools/hotplug: xendomains now depends on xen-init-dom0

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
Cc: Ian Campbell <ian.campbell@citrix.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Cc: Wei Liu <wei.liu2@citrix.com>

tools/hotplug: add systemd xen-init-dom0 service

Also prevent xenstored.service from writing Dom0 nodes. The
initialisation is now done with xen-init-dom0.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
Cc: Ian Campbell <ian.campbell@citrix.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Cc: Wei Liu <wei.liu2@citrix.com>
[ ijc -- ran autogen.sh as requested ]

tools/hotplug: fix clean target in systemd Makefile

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
Cc: Ian Campbell <ian.campbell@citrix.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Cc: Wei Liu <wei.liu2@citrix.com>

tools/hotplug: fix conditions in systemd service files for dom0

ConditionVirtualization= checks if systemd runs in some sort of guest.
It is not supposed to detect host capabilities. The current
implementation happens to work because systemd-detect-virt from v208
returns also 'xen' in a dom0. In v210 and later 'none' is returned and
no service files will be started.

Adjust the checks to detect a dom0 vs. native boot. Mounting xenfs
depends on /proc/xen, but should only be done for pvops because xenfs
exists only there. All other service files should not be started in
domU. The file /proc/xen/capabilities exists in both dom0 and domU in a
pvops kernel, but only in dom0 it contains 'control_d'. The existing
ExecStartPre= check will prevent starting in a domU.

ConditionVirtualization=!xen is true in a dom0. But this check is broken
in systemd v208, so its not used.

Signed-off-by: Olaf Hering <olaf@aepfle.de>
Cc: Ian Campbell <ian.campbell@citrix.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Cc: Wei Liu <wei.liu2@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>

tools/libxl: Fix building against libxl for LIBXL_API_VERSION < 0x040500

c/s 6276f66ebe "libxl: libxl_uuid_copy now takes a ctx argument" introduces
API compatibiltiy for libxl_uuid_copy() which sadly is not valid C. Fix it.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
CC: Ian Campbell <Ian.Campbell@citrix.com>
CC: Ian Jackson <Ian.Jackson@eu.citrix.com>
CC: Wei Liu <wei.liu2@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>

stubdom/Makefile: use QEMU_TRADITIONAL_LOC

In commit 8962a8f951ea83e8d10ee23aeb20266e4795b06e CONFIG_QEMU was
replaced by QEMU_TRADITIONAL_LOC. However stubdom/Makefile still uses
CONFIG_QEMU so building stubdom is likely to fail. This patch
replaces CONFIG_QEMU with QEMU_TRADITIONAL_LOC in stubdom/Makefile as
well.

Signed-off-by: Michael Young <m.a.young@durham.ac.uk>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Release-Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

Merge branch 'release-4.5.0-rc1' into staging

xen/Makefile: Update version to 4.5.0-rc

* Remove the rc number as this makes rc releases more convenient.
* Add the .0, since we conventionally call our actual releases things
like `4.4.0'.

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Tested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Release-Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

Xen 4.5.0-rc1: Update tag for both QEMU trees.

And change 'unstable' to 'rc1'.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

tools/libxl: Fix libxl_list_vcpu() following c/s 93e52d52

My reasoning regarding nr_cpus_out was wrong, as I had confused nr_cpus_out
with nr_vcpus_out.

Dario pointed this out, but the patch (having gained appropriate acks) got
committed before I could post a correction.

Noticed-by: Dario Faggioli <dario.faggioli@citrix.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
CC: Ian Campbell <Ian.Campbell@citrix.com>
CC: Ian Jackson <Ian.Jackson@eu.citrix.com>
CC: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Dario Faggioli <dario.faggioli@citrix.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>

README: mention git repos at xen.org

Signed-off-by: Olaf Hering <olaf@aepfle.de>
Cc: Ian Campbell <ian.campbell@citrix.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>

README: remove requirement for udev

There is no requirement for udev during package build. It may be
required at runtime even with libxl with run_hotplug_scripts=yes
in xl.conf.

Signed-off-by: Olaf Hering <olaf@aepfle.de>
Cc: Ian Campbell <ian.campbell@citrix.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>

x86: hvm: Allow configuration of the size of the mmio_hole.

If you add enough PCI devices then all mmio may not fit below 4G
which may not be the layout the user wanted. This allows you to
increase the below 4G address space that PCI devices can use and
therefore in more cases not have any mmio that is above 4G.

There are real PCI cards that do not support mmio over 4G, so if you
want to emulate them precisely, you may also need to increase the
space below 4G for them. There are drivers for these cards that also
do not work if they have their mmio space mapped above 4G.

This allows growing the MMIO hole to the size needed.

This may help with using pci passthru and HVM.

In the tools this is named mmio_hole_memkb.

Signed-off-by: Don Slutz <dslutz@verizon.com>
Acked-by: George Dunlap <george.dunlap@eu.citrix.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
[ ijc -- fixed build error in xl_cmdimpl.c with s/%PRIu64/%ld/.
Reworded title ]

Revert "xen: introduce arch_grant_(un)map_page_identity"

This reverts commit e25a5f4d8cf3b55718048abdd21c7d0de64ae54c.

With the removal of XENFEAT_grant_map_identity, this commit is not
needed anymore.

Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Reviewed-by: Julien Grall <julien.grall@linaro.org>
Acked-by: Ian Campbell <ian.campbell@citrix.com>

Revert "xen/arm: introduce XENFEAT_grant_map_identity"

Revert commit id 8d09ef6906ca0a9957e21334ad2c3eed626abe63.
Just keep the definition of XENFEAT_grant_map_identity.

XENFEAT_grant_map_identity is superseeded by GNTTABOP_cache_flush. As
XENFEAT_grant_map_identity causes additional tlb flushes, it is best to
remove the feature entirely now.

Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Reviewed-by: Julien Grall <julien.grall@linaro.org>
Acked-by: Ian Campbell <ian.campbell@citrix.com>

introduce GNTTABOP_cache_flush

Introduce a new hypercall to perform cache maintenance operation on
behalf of the guest. The argument is a machine address and a size. The
implementation checks that the memory range is owned by the guest or the
guest has been granted access to it by another domain.

Introduce grant_map_exists: an internal grant table function to check
whether an mfn has been granted to a given domain on a target grant
table. Check hypercall_preempt_check() every 4096 iterations in the
implementation of grant_map_exists.
Use the top 20 bits of the GNTTABOP cmd encoding to save the last ref
across the hypercall continuation.

Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86: introduce more cache maintenance operations

Move the existing flush_page_to_ram flushtlb.h.

Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>

arm: introduce invalidate_dcache_va_range

Take care of handling non-cacheline aligned addresses and sizes.

Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Reviewed-by: Julien Grall <julien.grall@linaro.org>
Acked-by: Ian Campbell <ian.campbell@citrix.com>

arm: return int from *_dcache_va_range

These functions cannot really fail on ARM, but their x86 equivalents can
(-EOPNOTSUPP). Change the prototype to return int.

Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Reviewed-by: Julien Grall <julien.grall@linaro.org>
Acked-by: Ian Campbell <ian.campbell@citrix.com>

arm: rename *_xen_dcache_* operations to *_dcache_*

Given that we are in Xen, it is obvious that these are Xen flushes.
Also the correspondent x86 functions are going to be named without
_xen_, so remove it here for consistency.

Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Reviewed-by: Julien Grall <julien.grall@linaro.org>
Acked-by: Ian Campbell <ian.campbell@citrix.com>

introduce gnttab_max_frames and gnttab_max_maptrack_frames command line options

Introduce gnttab_max_maptrack_frames: a new Xen command line option to
specify the max number of maptrack frames per domain.
Deprecate the old gnttab_max_nr_frames and introduce gnttab_max_frames
instead, that doesn't affect the maptrack. Keep gnttab_max_nr_frames for
compatibility.

Rename internally max_nr_grant_frames to max_grant_frames to avoid
confusions.

Introduce DEFAULT_MAX_MAPTRACK_FRAMES, that is completely independent
from max_nr_grant_frames.

Remove MAX_MAPTRACK_TO_GRANTS_RATIO that is now only used in one place
for compatibility.

Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>

docs: EFI configuration file must be ASCII type

Currently Xen can only read the configuration file if it is in ASCII
format. If it is in CHAR16 or CHAR8 it will choke. One way to verify
this is to use 'file':

xen.cfg: ASCII text
xen-char16.cfg: Little-endian UTF-16 Unicode text, with CRLF, CR line terminators

The latter is no good.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

EFI: fix efi_arch_allocate_mmap_buffer() to return new size

efi_arch_allocate_mmap_buffer() allocates a buffer for the EFI memory map, and
for ARM it allocates a larger buffer than requested. This is done to account
for the increase in the map size that may occur when the allocation is made.
The previous code allocated a larger buffer, but did not adjust the size to
match. This caused the later call to GetMemoryMap() to fail with a
BUFFER_TOO_SMALL error, since the original, smaller size was used. This patch
changes the argument to efi_arch_allocate_mmap_buffer() to be a pointer to
UINTN, and the ARM version updates the size on a successful allocation.
The x86 version uses a different allocation method, so only the function
argument type is changed.
Also add decode of the BUFFER_TOO_SMALL error code to PrintErrMesg().

Signed-off-by: Roy Franz <roy.franz@linaro.org>
Acked-by: Ian Campbell <ian.campbell@citrix.com> [ARM]
Acked-by: Jan Beulich <jbeulich@suse.com> [non-ARM]

x86/boot: add memory to clobber list in reloc_mbi_struct()

Assembly inline in reloc_mbi_struct() clobbers
memory so tell compiler about that.

Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

x86/boot: use constant in head.S instead of hardcoded value

..to access multiboot.mem_lower data.

Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

x86/boot: fix reloc.S build dependencies

reloc.S is not rebuild if header included
in reloc.c is updated. Fix this issue.

Additionally, remove reloc.S build dependency
on head.S because anything from reloc.S does
not depend on head.S.

Add reloc.c dependency to reloc.o build rule for consistency.

Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

correct the documentation of where the Xen cpuid leaves can be found

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

fix listing of vcpus when domains lacking any vcpus exist

On a system which looks like this:

[root@st04 ~]# xl list
Name                                        ID   Mem VCPUs      State   Time(s)
Domain-0                                     0   752     4     r-----   46699.3
(null)                                       1     0     0     --p---       0.0
(null)                                       2     0     0     --p---       0.0
(null)                                       3     0     0     --p---       0.0
badger                                      25     0     1     --p---       0.0

`xl vcpu-list` failes as so:

[root@st04 ~]# xl vcpu-list
Name                                ID  VCPU   CPU State   Time(s) CPU Affinity
Domain-0                             0     0    0   -b-   12171.0  all
Domain-0                             0     1    1   -b-   11779.6  all
Domain-0                             0     2    2   -b-   11599.0  all
Domain-0                             0     3    3   r--   11007.0  all
libxl: critical: libxl__calloc: libxl: FATAL ERROR: memory allocation failure (libxl__calloc, 4294935299 x 40)
: Cannot allocate memory
libxl: FATAL ERROR: memory allocation failure (libxl__calloc, 4294935299 x 40)

The root cause of this is in Xen.  getdomaininfo() has no way of expressing
"this domain has no vcpus".  Previously, info->max_vcpu_id would be returned
uninitialised in such a case.

Unfortunately, setting it to 0 as a default is not appropriate.  A max_vcpu_id
of 0 and nr_online_cpus of 0 is the valid state for a single vcpu domain which
is in the process of being destroyed.

As all components are required to add 1 to max_vcpu_id to get the number of
vcpus, an id of ~0U is not valid to be used.  Explicitly define this as an
invalid max vcpu value, and use it to express "no vcpus" in getdomaininfo()

In libxl, the issue is seen as libxl_list_vcpu() attempts to use the
uninitialised domaininfo.max_vcpu_id for memory allocation.

Check domaininfo.max_vcpu_id against the new sentinel value
XEN_INVALID_MAX_VCPU_ID, and return early.  This means that it is now valid
for libxl_list_vcpu() to return NULL for a domain which lacks any vcpus.

As part of this change, remove the pointless call to libxl_get_max_cpus(),
whose returned value is unconditionally clobbered in the for() loop.

Reported-by: Euan Harris <euan.harris@citrix.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Don Slutz <dslutz@verizon.com>
Acked-by: Ian Campbell <Ian.Campbell@citrix.com>

x86/setup: correct register clobbers for the asm statement when resyncing the stack

When resyncing the stack, the asm statement does not identify %rsi, %rdi and
%rcx as clobbered by the 'rep movsq'.

Luckily, there are no functional problems in the generated code. GCC decides
not to save any of them before calling boostrap_map(), which clobbers them.

Correct the clobbers, by listing them as earlyclobber discarded outputs.

Reported-by: Daniel Kiper <daniel.kiper@oracle.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Daniel Kiper <daniel.kiper@oracle.com>
Tested-by: Daniel Kiper <daniel.kiper@oracle.com>

x86/hvm: further restrict access to x2apic MSRs

The x2apic specification reserves the entire MSR range 0x800-0xbff, while only
the first 0x3f MSRs have defined purposes.  All reserved MSRs in this region
are architecturally required to raise #GP faults upon access.

Xen used to pass this entire range to hvm_x2apic_msr_{read,write}(), but the
range was restricted somewhat by XSA-108 (c/s 61fdda7ac) to prevent guests
being able to read pages adjacent to the domheap page backing the vlapic->regs
array.

While removing the vulnerability, a side effect of XSA-108 was that the MSR
range 0x900-0xbff fell through the switch statement and ends up reading the
hosts x2apic range. This behaviour is a problem in general, but specifically
it turns out that MSRs 0xa00-0xa02 are implemented (but undocumented) on
certain SandyBridge and IvyBridge systems.

Experimentally, no operating system in XenServer's test suite (including all
versions of Windows currently supported by Microsoft) ever peek at these MSRs,
even on hosts where some of them are implemented.

This patch undoes the fix for XSA-108 (c/s 61fdda7ac), returning the primary
bounds check to the entire specified range.  hvm_x2apic_msr_write() was always
safe, as it is whitelist based.  hvm_x2apic_msr_read() changes to a whitelist
approach, which avoids the vulnerability, and provides a more architecturally
accurate emulation of the reserved MSRs (which would previously read as 0
rather than fault).

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Jun Nakajima <jun.nakajima@intel.com>

x86: define cmdline_cook() loader_name argument as a const

cmdline_cook() loader_name argument is not changed so
define it as a const.

Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>

clean target should remove xen.efi binary

Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>

x86/viridian: freeze time reference counter when domain is paused

In XenServer system test it has become apparent that versions of Windows
that make use of the time reference counter enlightenment cannot cope with
large jumps forward in the value read from the MSR. Specifically,
suspending a very large domain took approx. 45 minutes to complete and
when the domain was resumed it was discovered that the WMI (Windows
Management Instrumentation) service had hung.

The reason a large jump forward is seen by the guest is that, when a guest
is suspended, the guest stops running when the SCHEDOP_suspend hypercall is
made, however the MSR value essentially keeps incrementing until the
tool-stack issues DOMCTL_gethvmcontext.

This patch adds code to freeze the value of the time reference counter
on domain pause and 'thaw' it on domain unpause, but only thaw it if the
domain is not shutting down. The absolute value of the counter is then
saved in the viridian domain context record. This prevents the guest OS
from experiencing large jumps in the value of the MSR and has been shown
to reliably fix the problem with WMI.

Signed-off-by: Paul Durrant <paul.durrant@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>

xen/arm64: Use __flush_dcache_area instead of __flush_dcache_all

When booting with EFI, __flush_dcache_all does not correctly flush data.
According to Mark Rutland, __flush_dcache_all is not guaranteed to push
data to the PoC if there is a system-level cache as it uses Set/Way
operations. Therefore, this patch switchs to use the "__flush_dcache_area"
mechanism, which is coppied from Linux.
Add flushing of FDT in addition to Xen text/data.
Remove now unused __flush_dcache_all and related helper functions.
Invalidate the instruction tlb before turning on paging
later on when starting Xen in EL2.

Signed-off-by: Suravee Suthikulpanit <Suravee.Suthikulpanit@amd.com>
Signed-off-by: Roy Franz <roy.franz@linaro.org>
Acked-by: Ian Campbell <ian.campbell@citrix.com>